class: title-slide title-wrangling center middle # Advanced R for Econometricians ## Data Wrangling with `dplyr` and `tidyr` --- class: top, left ## Tibble - Tibbles are the main data structure used in the `tidyverse` and can be viewed as a modern version of the data frame. - `tibble()` creates a new tibble from scratch. ```r library(tidyverse) new_tibble <- tibble(x = 1:3, y = letters[1:3]) ``` - `as_tibble()` creates a tibble from an existing object (e.g. a data frame or a matrix). ```r a_data.frame <- data.frame(x = 1:3, y = letters[1:3]) a_tibble <- as_tibble(a_data.frame) ``` - Everything that works for data frames also works with tibbles because ```r class(a_tibble) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- class: top, left ## Tibble vs. data.frame - Why should we use tibbles instead of data frames? See the following comparison. ```r data(mtcars) mtcars_tibble <- as_tibble(mtcars) ``` - different print method ```r print(mtcars) print(mtcars_tibble) ``` - more consistent subsetting behavior ```r mtcars[ ,1:2] # returns a data.frame mtcars[ ,1] # returns a vector mtcars_tibble[ ,1:2] # returns a tibble mtcars_tibble[ ,1] # returns a tibble ``` --- ## Tibble vs. data.frame - lazy and sequential evaluation ```r data.frame(x = 1:3, y = x^2) # not working tibble(x = 1:3, y = x^2) ``` - does not change input types ```r text <- letters[1:3] class(text) df <- data.frame(text) # stringsAsFactors = TRUE is the default class(df$text) tib <- tibble(text) class(tib$text) ``` ``` ## [1] "character" ## [1] "character" ## [1] "character" ``` - For more see `vignette("tibble")`. --- class: top, left ## dplyr The `dplyr` package is the most important package of the `tidyverse` when it comes to data manipulation and transformation. The core characteristics of the package: - focus on data frames (data frame in → data frame out) - verbose compared to base R which leads to longer but easier to read code - consistent syntax. You can compare the functionality to that of a relational database which provides methods to work with single tables and multiple tables. --- class: top, left ## Single Table Functions First, we look at functions that work on a single table. - `filter()` to select rows based on their values. ```r filter(mtcars_tibble, hp >= 100 & mpg >= 20) ``` - `slice()` to select rows by position. ```r slice(mtcars_tibble, 10:15) ``` - `select()`/`rename()` to select/rename variables based on their names. ```r select(mtcars_tibble, cyl, hp) ``` <!--- old slide --- class: top, left ## Single Table Functions - `arrange()` to reorder rows. ```r arrange(mtcars_tibble, cyl) arrange(mtcars_tibble, desc(cyl)) ``` - `mutate()` and `transmute()` to add new variables that are functions of existing variables. ```r mutate(mtcars_tibble, gpm = 1/mpg) transmute(mtcars_tibble, gpm = 1/mpg) ``` - `summarise()` to condense multiple values to a single value. ```r summarise(mtcars_tibble, mean_hp = mean(hp), max_mpg = max(mpg)) ``` ---> --- class: top, left ## Single Table Functions - `arrange()` to reorder rows. ```r arrange(mtcars_tibble, cyl) arrange(mtcars_tibble, desc(cyl)) ``` - `relocate()` to reorder columns (variables). ```r relocate(mtcars_tibble, gear) relocate(mtcars_tibble, gear, .after = last_col()) relocate(mtcars_tibble, gear, .before = mpg) ``` - `mutate()` and `transmute()` to add new variables that are functions of existing variables. ```r mutate(mtcars_tibble, gpm = 1/mpg) transmute(mtcars_tibble, gpm = 1/mpg) ``` - `summarise()` to condense multiple values to a single value. ```r summarise(mtcars_tibble, mean_hp = mean(hp), max_mpg = max(mpg)) ``` --- class: top, left ## Single Table Functions - `sample_n()` and `sample_frac()` to take random samples. ```r sample_n(mtcars_tibble, 10) sample_frac(mtcars_tibble, 0.1) ``` - `distinct()` removes duplicate rows. ```r tibble_with_duplicates <- tibble(x = c(1,1,2,3), y = c(4,4,5,6)) tibble_without_duplicates <- distinct(tibble_with_duplicates) ``` - `count()` and `tally()` count rows. ```r count(mtcars_tibble); count(mtcars_tibble, cyl) tally(mtcars_tibble); tally(mtcars_tibble, cyl) ``` --- class: top, left ## Filter The function `filter()` selects rows based on their values. - Use comparison operators `==`, `>`, `>=`, `<`, `<=`, and `!=` to filter rows based on a condition. ```r filter(mtcars_tibble, hp == 100) ``` - Use logical operators such as `&` and `|` to combine multiple conditions. ```r filter(mtcars_tibble, hp > 100 & cyl == 4) filter(mtcars_tibble, hp > 100 | cyl == 4) ``` --- - Additional arguments to `filter()` are combined with `&` which is the same as the above. ```r filter(mtcars_tibble, hp > 100, cyl == 4) ``` - For the common operation `x >= left & x <= right` there is the shorthand `between(x, left, right)`. ```r filter(mtcars_tibble, x >= 90 & x <= 110) filter(mtcars_tibble, between(hp, 90, 110) ``` --- ## Select The function `select()` selects columns based on their name. - The function can handle standard and non-standard evaluation (more about this later). ```r select(mtcars_tibble, disp, vs) select(mtcars_tibble, "disp", "vs") ``` - You can also work with character vectors. ```r select(mtcars_tibble, c("disp", "vs")) ``` - Use `-` to select all except the provided columns. ```r select(mtcars_tibble, -disp, -vs) select(mtcars_tibble, -c("disp", "vs")) ``` - There is a whole bunch of helper functions for `select()` on the next slides. --- class: top, left ## Helper Functions for `select()` - Select columns whose name contains a character string. ```r select(iris, contains(".")) ``` - Select columns whose name ends with a character string. ```r select(iris, ends_with("Length")) ``` - Select columns whose name matches a regular expression. ```r select(iris, matches(".t.")) ``` - Select columns named `x1`, `x2`, `x3`, `x4`, `x5`. ```r select(iris, num_range("x", 1:5)) ``` --- class: top, left ## Helper Functions for `select()` - Select all columns from `Sepal.Width` to `Petal.Width`. ```r select(iris, Sepal.Width:Petal.Width) ``` - Select columns whose names are in a group of names. ```r select(iris, one_of(c("Species", "Genus"))) ``` - Select columns whose name starts with a character string. ```r select(iris, starts_with("Sepal")) ``` --- ## Mutate The function `mutate()` adds new columns or replaces old columns based on existing columns. ```r mutate(mtcars_tibble, gpm = 1/mpg, disp = disp - mean(disp) # replaces disp ) ``` - Use `lag()` and `lead()` to access values that are not in the same row (e.g. for time series data). ```r rw <- tibble(t = 1:10, y = cumsum(rnorm(10))) mutate(rw, increments = y - lag(y, k = 1)) ``` - Use `min_rank()` to add a column with the rank (`?min_rank()` gives you an overview of further ranking functions). ```r mutate(mtcars_tibble, hp_rank_smallest_first = min_rank(hp), hp_rank_largest_first = min_rank(desc(hp)) ) ``` --- ## Summarise `summarise()` can be used to collapse columns to a single row used with e.g.: - base R functions such as `mean()`, `sd()`, `max()` - dplyr specific functions for - position `first()`, `nth()`, `last()` - counts `n()`, `n_disticnt()` ```r summarise(mtcars_tibble, mean_hp = mean(hp), max_wt = max(wt), n = n(), third_cyl = nth(cyl, 3) ) ``` - functions you have written yourself ```r my_variance <- function(x) 1/(n()-1) * sum((x - mean(x))^2) summarise(by_cyl, my_variance(mpg)) ``` --- ## Summarise - You cannot use `summarise()` with functions that return more than one value per group. ```r two_return_vaules <- function(x) c(mean(x), sd(x)) summarise(by_cyl, two_return_vaules(mpg)) ``` - the function can take multiple variable or fixed inputs. ```r two_input_values <- function(x, y) sum(x + y) summarise(by_cyl, two_input_values(hp, qsec)) # variable inputs summarise(by_cyl, two_input_values(hp, 1)) # fixed input ``` --- class: top, left ## Exercises <ol start="1"> <li> Load the `nycflights13::flights` data set. Does it come formatted as a tibble? </li> <li> Find all flights that </li> <ul> <li> Had an arrival delay of two or more hours.</li> <li> Flew to Houston (IAH or HOU).</li> <li> Were delayed by at least an hour, but made up over 30 minutes in flight.</li> <li> Departed between midnight and 6am.</li> </ul> <li> Which flights took the longest and which the shortest? </li> <li> Brainstorm as many ways as possible to select `dep\text{_}time`, `dep\text{_}d elay`, `arr\text{_}time`, and `arr\text{_}d elay`. </li> <li> Create a new variable `first\text{_}flight` which takes the value `TRUE` if it is the first flight of the day. </li> </ol> --- class: top, left ## Grouping To exploit the full potential of `summarise()` we pair it with `group_by()`. ```r by_cyl <- group_by(mtcars_tibble, cyl) print(summarise(by_cyl, mean(mpg))) ``` ``` ## # A tibble: 3 x 2 ## cyl `mean(mpg)` ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1 ``` Use `ungroup()` to remove the grouping, when you want to work on the whole data set again. --- class: top, left ## Grouping It is also possible to use `group_by()` together with a lot of other single table operations. ```r by_cyl <- group_by(mtcars_tibble, cyl) # group mean is used for centering mutate(by_cyl, centered_mpg = mpg - mean(mpg)) transmute(by_cyl, centered_mpg = mpg - mean(mpg)) # return only first 3 rows for each group slice(by_cyl, 1:3) # counts the number of rows for each group count(by_cyl) # returns all groups with more than 10 rows filter(by_cyl, n() > 10) # sorts one group after the other arrange(by_cyl, disp, .by_group = TRUE) ``` --- class: top, left ## The Pipe Usually, we combine multiple operations, e.g. first we filter, then we mutate, group and finally summarise. We could do it like this: ```r filtered <- filter(mtcars_tibble, cyl <= 8) mutated <- mutate(filtered, gpm = 1/mpg) grouped <- group_by(mutated, gear) summarized <- summarise(grouped, count = n(), mean_hp = mean(hp), mean_gpm = mean(gpm)) ``` This code has the disadvantage that we need to save intermediate results that we do not care about. --- class: top, left ## The Pipe We could nest all those functions to get rid of the intermediate result. ```r summarise(group_by(mutate(filter(mtcars_tibble, cyl <= 8), gpm = 1/mpg)), count = n(), mean_hp = mean(hp), mean_gpm = mean(gpm)) ``` ``` ## # A tibble: 1 x 3 ## count mean_hp mean_gpm ## <int> <dbl> <dbl> ## 1 32 147. 0.0542 ``` However, readability gets worse and debugging will become much harder. --- class: top, left ## The Pipe That is where the pipe operator `%>%` from the `magrittr` package comes in handy. The value on the left side of the operator becomes the input to the function on the right side. ```r mtcars_tibble %>% mutate(gpm = 1/mpg) %>% group_by(gear) %>% summarise(count = n(), mean_hp = mean(hp), mean_gpm = mean(gpm)) ``` ``` ## # A tibble: 3 x 4 ## gear count mean_hp mean_gpm ## <dbl> <int> <dbl> <dbl> ## 1 3 15 176. 0.0650 ## 2 4 12 89.5 0.0424 ## 3 5 5 196. 0.0504 ``` The pipe operator is not limited to `dplyr` or the `tidyverse` ```r 1:6 %>% mean ``` --- class: top, left ## Scoped Functions We often perform operations on all or several columns at once. In this case the standard functions are pedestrian. Most helpful are the scoped versions of `summarise()`, `mutate()` and `select()`. - Use `summarise_all()`to summarise every column. ```r mtcars_tibble %>% summarise_all(mean) ``` - Use `summarise_if()` to logically subset the columns to be summarised. ```r iris %>% summarise_if(is.numeric, mean) ``` - Use `summarise_at()` to summarise selected columns. ```r iris %>% summarise_at(c("Sepal.Length", "Sepal.Width"), mean) ``` --- class: top, left ## Advanced Topics If `summarise()` is too limited, because it can only handle functions that have a single return value, use - `group_modify(.tbl, .f, ...)` to return a grouped tibble. In that case `.f` must return a data frame. - `group_map(.tbl, .f, ...)` to returns a list of results from calling `.f` on each group. - `group_walk(.tbl, .f, ...)` to call `.f` for side effects and returns the input `.tbl`, invisibly. Those functions are closely related to the functions in the `purrr` package which we will talk about later. Here we only give a short example of how they can be used. .code80[ ```r mtcars %>% group_by(cyl) %>% group_map(~ lm(mpg ~ hp, data = . )) mtcars %>% group_by(cyl) %>% group_modify(~ broom::tidy(lm(mpg ~ hp, data = . ))) par(mfrow = c(2,2)) mtcars %>% group_by(cyl) %>% group_walk(~ plot(mpg ~ hp, data = . )) ``` ] --- class: top, left ## Exercises Use dplyr on the Boston crime data to: <ol start="6"> <li> Find out in which district the most incident where reported. </li> <li> Create an ordered list of the top ten districts according to the number of drug violations. Start with the district with most drug violations. </li> <li> What was the most dangerous hour of the week according to the number of homicides (`OFFENSE\text{_}CODE\text{_}GROUP == "Homicide"`)? </li> <li> Calculate for each district the proportion of motor vehicle accidents to all reported incidents (`OFFENSE\text{_}CODE\text{_}GROUP == "Motor Vehicle Accident Response"`). </li> <li> Sort the data set chronologically by year and month. Then take the first 10.000 observations and compute the proportion of incidents where a shooting was involved for each district. Now take a random sample of size 10.000. Is there a noteworthy difference? **Hint:** There is a variable called `SHOOTING`. </li> </ol> --- class: top, left ## Functions for 2 Tables ```r band_members ``` ``` ## # A tibble: 3 x 2 ## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles ## 3 Paul Beatles ``` ```r band_instruments ``` ``` ## # A tibble: 3 x 2 ## name plays ## <chr> <chr> ## 1 John guitar ## 2 Paul bass ## 3 Keith guitar ``` --- class: top, left ## Left Join A left join keeps all observations from the left table, and matches records from the right table based on one (or multiple) columns (`by`). ```r band_members %>% left_join(band_instruments, by = "name") ``` ``` ## # A tibble: 3 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ``` --- class: top, left ## Inner Join An inner join matches rows from the left and the right table but only keeps the rows where both tables have matching rows in the `by` column. ```r band_members %>% inner_join(band_instruments, by = "name") ``` ``` ## # A tibble: 2 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 John Beatles guitar ## 2 Paul Beatles bass ``` --- class: top, left ## Full Join A full (outer) join matches rows from the left and the right table and keeps all observations ```r band_members %>% full_join(band_instruments, by = "name") ``` ``` ## # A tibble: 4 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ## 4 Keith <NA> guitar ``` --- class: top, left ## Semi Join A semi join keeps all observations of the left table where there are matching values in the right table. It keeps only columns from the left. ```r band_members %>% semi_join(band_instruments, by = "name") ``` ``` ## # A tibble: 2 x 2 ## name band ## <chr> <chr> ## 1 John Beatles ## 2 Paul Beatles ``` --- class: top, left ## Anti Join An anti join returns all rows from the first table where no matches are found in the second table. ```r band_members %>% anti_join(band_instruments, by = "name") ``` ``` ## # A tibble: 1 x 2 ## name band ## <chr> <chr> ## 1 Mick Stones ``` --- class: top, left ## Reshaping Data with tidy R The goal of `tidyr` is to help you create tidy data. Tidy data is data where: - each variable is in a column - each observation is a row - each value is a cell. We will talk about - `gather()` takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. - `spread()` takes two columns (key & value), and spreads into multiple columns: it makes “long” data wider. - `separate()` takes a single column and turns it into multiple columns based on a separator. - `unite()` takes multiple columns and pastes them together into one column. --- class: top, left ## Tidy Data ```r tidyr::table1 ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` Each observation is in one row and each variable is one column. --- class: top, left ## Untidy data ```r tidyr::table2 ``` ``` ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` Observations are scattered across multiple rows --- class: top, left ## Spreading The function `tidyr::spread()` takes as arguments - `key`, containing the variable names - `value`, containing the values for the variables in the `key` column. ```r table2 %>% spread(key = type, value = count) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- ## Untidy Data ```r tidyr::table4b ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` - Columns do not contain variables and column names are values. --- ## Gathering The function `gather()` takes as arguments - the column names which are actual values - `key`, the name of the newly created variable resulting from the column name - `value`, the name of the newly created variable. ```r table4a %>% gather(`1999`, `2000`, key = "year", value = "cases") ``` ``` ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766 ``` --- ## Untidy Data ```r tidyr::table3 ``` ``` ## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` - One column contains two variables. --- ## Separating The function `separate()` splits one column into multiple columns. As input it takes: - `col`, the column to be separated - `into`, the names of the newly created columns - `sep`, a regular expression which tells the function where to split (by default any non-alphanumeric character) - `convert`, a logical value which controls whether or not to run `run type.convert()` on the new columns. .code70[ ```r table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ] --- ## Untidy Data ```r tidyr::table5 ``` ``` ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583 ``` - One variable is spread over multiple columns. --- ## Unite The function `unite()` pastes multiple columns together. As input it takes: - `col`, the name of the new column - `...`, the columns to be pasted together - `sep`, the separator between values. ```r table5 %>% unite(year, century, year, sep = "") ``` ``` ## # A tibble: 6 x 3 ## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` --- ## Exercises <ol start="11"> <li> Compute the rate for `table2`, and `table4a` + `table4b`. </li> <li> Recreate the plot given by the code below using `table2` instead of `table1`. </li> </ol> ```r library(ggplot2) ggplot(table1, aes(year, cases)) + geom_line(aes(group = country), colour = "grey50") + geom_point(aes(colour = country)) ```